#voice ai

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

2026-03-09 Shangeth Rajaa Interspeech 2026 (Accepted)

#Voice AI #Turn-Taking #Spoken Dialogue #Speech LLM

Dual-channel generative pretraining for learning natural turn-taking in spoken dialogue without labeled data. A 0.5B model that outperforms models 6x its size on turn prediction.

View

Speech LLMs for Conversations

2024-05-09

#Voice AI #Speech LLM #Conversational AI

A multimodal speech LLM that processes audio directly to enhance conversational AI while reducing overhead compared to traditional ASR-LLM-TTS pipelines.

View

Improving End-to-End SLU Performance with Prosodic Attention and Distillation

2023-08-20 Shangeth Rajaa Interspeech 2023, pp. 1114–1118

#Voice AI #Spoken Language Understanding #Prosody #Speech

Two techniques for incorporating prosody into end-to-end SLU: prosody-attention and prosody-distillation. Up to 8% intent classification accuracy improvement on SLURP.

View

Improving Spoken Language Identification with Map-Mix

2023-06-04 Shangeth Rajaa, Kriti Anandan, Swaraj Dalmia, Tarun Gupta, Eng Siong Chng ICASSP 2023 — IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5

#Voice AI #Speech #Language Identification #Data Augmentation

Map-Mix: a data augmentation approach using model training dynamics to guide latent mixup sampling, giving ~2% weighted F1 improvement on low-resource dialect classification.

View

Skit-S2I: An Indian Accented Speech to Intent Dataset

2022-12-26 Shangeth Rajaa, Swaraj Dalmia, Kumarmanas Nethil arXiv preprint arXiv:2212.13015

#Voice AI #Spoken Language Understanding #Dataset #Speech

The first public Indian-accented SLU dataset in the banking domain. SSL speech representations beat ASR-based approaches for intent classification.

View

Feature Disentanglement - I

2022-02-22

#Voice AI #Speech Representation #Deep Learning

How deep learning models can isolate independent factors of variation in data through VAEs and Beta-TCVAE, enabling controlled synthesis and better downstream representations.

View

Learning Speaker Representation with Semi-supervised Learning Approach for Speaker Profiling

2021-10-24 Shangeth Rajaa, Pham Van Tung, Chng Eng Siong arXiv preprint arXiv:2110.13653

#Voice AI #Speaker Profiling #Speech Representation #Semi-supervised Learning

A semi-supervised framework for speaker profiling that leverages external unlabelled corpora via supervised, unsupervised, and consistency training, achieving RMSE of 6.8 years on age estimation.

View